Numerous variables, including the quantity of salt (or salinity), bacteria levels, dissolved oxygen concentrations, and the amount of debris suspended in the water, are used to determine the quality of water. To assess the water quality in some bodies of water, measurements of the amount of pesticides, herbicides, heavy metals, and other contaminants may also be made.
Even though water quality is determined by scientific measurements, it is not as straightforward as saying "that water is good" or "that water is terrible." Therefore, the decision is often based on the water's intended use—is it for drinking, washing a car with, or something else?
Thus, the aim of this project is to predict the quality of water for drinking purpose. Given parameters such as pH, Hardness, Solids, Chlroamines, Sulfate, Organic_Carbon, Trihalomethaes, and Turbidity, the POTABILITY is predicted.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from warnings import filterwarnings
from collections import Counter
import seaborn as sns
import plotly
import plotly.offline as pyo
import plotly.express as px
import plotly.graph_objs as go
pyo.init_notebook_mode()
import plotly.figure_factory as ff
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score
filterwarnings('ignore')
df = pd.read_csv('water_potability.csv')
df.head()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 204.890456 | 20791.31898 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.960000 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.05786 | 6.635246 | NaN | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.54173 | 9.275884 | NaN | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.41744 | 8.059332 | 356.886136 | 363.266516 | 18.436525 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.98634 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3276 entries, 0 to 3275 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ph 2785 non-null float64 1 Hardness 3276 non-null float64 2 Solids 3276 non-null float64 3 Chloramines 3276 non-null float64 4 Sulfate 2495 non-null float64 5 Conductivity 3276 non-null float64 6 Organic_carbon 3276 non-null float64 7 Trihalomethanes 3114 non-null float64 8 Turbidity 3276 non-null float64 9 Potability 3276 non-null int64 dtypes: float64(9), int64(1) memory usage: 256.1 KB
df.describe()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 2785.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 2495.000000 | 3276.000000 | 3276.000000 | 3114.000000 | 3276.000000 | 3276.000000 |
| mean | 7.080795 | 196.369496 | 22014.092526 | 7.122277 | 333.775777 | 426.205111 | 14.284970 | 66.396293 | 3.966785 | 0.390110 |
| std | 1.594320 | 32.879761 | 8768.570828 | 1.583085 | 41.416840 | 80.824064 | 3.308162 | 16.175008 | 0.780384 | 0.487849 |
| min | 0.000000 | 47.432000 | 320.942611 | 0.352000 | 129.000000 | 181.483754 | 2.200000 | 0.738000 | 1.450000 | 0.000000 |
| 25% | 6.093092 | 176.850538 | 15666.690300 | 6.127421 | 307.699498 | 365.734414 | 12.065801 | 55.844536 | 3.439711 | 0.000000 |
| 50% | 7.036752 | 196.967627 | 20927.833605 | 7.130299 | 333.073546 | 421.884968 | 14.218338 | 66.622485 | 3.955028 | 0.000000 |
| 75% | 8.062066 | 216.667456 | 27332.762125 | 8.114887 | 359.950170 | 481.792305 | 16.557652 | 77.337473 | 4.500320 | 1.000000 |
| max | 14.000000 | 323.124000 | 61227.196010 | 13.127000 | 481.030642 | 753.342620 | 28.300000 | 124.000000 | 6.739000 | 1.000000 |
df.columns = [col.upper() for col in df.columns]
colors_blue = ["#132C33", "#264D58", '#17869E', '#51C4D3', '#B4DBE9']
colors_dark = ["#1F1F1F", "#313131", '#636363', '#AEAEAE', '#DADADA']
colors_green = ['#01411C','#4B6F44','#4F7942','#74C365','#D0F0C0']
sns.palplot(colors_blue)
sns.palplot(colors_dark)
sns.palplot(colors_green)
Potable water is defined as water that is suitable for human consumption (i.e., water that can be used for drinking or cooking). The term implies that the water is drinkable as well as safe. Drinkable water means it is free of unpleasant odors, tastes and colors, and is within reasonable limits of temperature. Safe water means it contains no toxins, carcinogens, pathogenic micro-organisms, or other health hazards.
# Perform value counts and create a DataFrame
d = pd.DataFrame(df['POTABILITY'].value_counts()).reset_index()
d.columns = ['POTABILITY', 'count']
# Create the pie chart
fig = px.pie(d, values='count', names=['Not Potable', 'Potable'], hole=0.4, opacity=0.6,
color_discrete_sequence=[colors_green[3], colors_blue[3]],
labels={'label': 'POTABILITY', 'count': 'No. Of Samples'})
fig.add_annotation(text='POTABILITY',
x=0.5, y=0.5, showarrow=False, font_size=14, opacity=0.7, font_family='monospace')
fig.update_layout(
font_family='monospace',
title=dict(text='How many samples of water are Potable?', x=0.47, y=0.98,
font=dict(color=colors_dark[2], size=20)),
legend=dict(x=0.37, y=-0.05, orientation='h', traceorder='reversed'),
hoverlabel=dict(bgcolor='white')
)
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()
A simple definition is that it measures how acidic or basic a solution is. Pure water has a neutral pH of 7, which indicates that it is neither acidic or basic. However, the scale ranges from 0 (very acidic) to 14 (very basic). It is normal for water to have a range of between 6.5 and 8.5 on the scale.
# plot histogram
fig = px.histogram(df, x='PH', color="POTABILITY", color_discrete_sequence=[colors_green[3], colors_blue[3]],
marginal="box", nbins=100, opacity=0.7, barmode='group')
# include vertical lines
fig.add_vline(x=6.5, line_width=2, line_dash="dash", line_color=colors_dark[2])
fig.add_vline(x=8.5, line_width=2, line_dash="dash", line_color=colors_dark[2])
# fig.add_vrect(x0=6.5, x1=8.5, fillcolor=colors_blue[4], opacity=0.7, line_width=0)
# include in-chart texts
fig.add_annotation(dict(font=dict(color=colors_dark[2],size=10)), text='Safe for drinking', x=7.5, y=90, showarrow=False)
# include title, xaxis and yaxis labels
fig.update_layout(
font_family='monospace',
title=dict(text='pH Level Distribution',x=0.47,y=0.98,
font=dict(color=colors_dark[2],size=20)),
xaxis_title_text='pH Level',
yaxis_title_text='Count',
)
fig.show()
In scientific terms, water hardness is generally the amount of dissolved calcium and magnesium in water. But in layman's terms, you may notice water hardness when your hands still feel slimy after washing with soap and water, or when your drinking glasses at home become less than crystal clear. It is usually measured in mg/L. General guidelines for classification of waters are: 0 to 70 mg/L (milligrams per liter) is classified as soft; 70 to 150 mg/L as moderately hard; 150 to 300 mg/L as hard; and more than 300 mg/L as very hard.
# plot histogram
fig = px.histogram(df, x='HARDNESS', color="POTABILITY", color_discrete_sequence=[colors_green[3], colors_blue[3]],
marginal="box", nbins=100, opacity=0.7, barmode='group')
# include vertical lines
fig.add_vline(x=70, line_width=2, line_dash="dash", line_color=colors_dark[2])
fig.add_vline(x=150, line_width=2, line_dash="dash", line_color=colors_dark[2])
fig.add_vline(x=300, line_width=2, line_dash="dash", line_color=colors_dark[2])
# include in-chart texts
fig.add_annotation(dict(font=dict(color=colors_dark[2],size=15)), text='Soft<br>(0-70)', x=50, y=120, showarrow=False)
fig.add_annotation(dict(font=dict(color=colors_dark[2],size=15)), text='Moderately Hard<br>(70-150)', x=110, y=120, showarrow=False)
fig.add_annotation(dict(font=dict(color=colors_dark[2],size=15)), text='Hard<br>(150-300)', x=250, y=120, showarrow=False)
fig.add_annotation(dict(font=dict(color=colors_dark[2],size=15)), text='Very Hard<br>(>300)', x=320, y=120, showarrow=False)
# include title, xaxis and yaxis labels
fig.update_layout(
font_family='monospace',
title=dict(text='Hardness Distribution',x=0.47,y=0.98,
font=dict(color=colors_dark[2],size=20)),
xaxis_title_text='Hardness (mg/L)',
yaxis_title_text='Count',
)
fig.show()
Total dissolved solids (TDS) comprise inorganic salts (primarily calcium, magnesium, potassium, sodium, bicarbonates, chlorides and sulfates) and some small amounts of organic matter that are dissolved in water. TDS in drinking water originate from natural sources, sewage, urban runoff, industrial wastewater, and chemicals used in the water treatment process, and the nature of the piping or hardware used to convey the water, i.e. the plumbing. Quantityis excellent if its less than 300 ppm; good, between 300 and 600 ppm; fair, between 600 and 900 ppm; poor, between 900 and 1200 ppm; and unacceptable, greater than 1200 mg/litre. However, Water with extremely low concentrations of TDS may also be unacceptable because of its flat, insipid taste. It's worth noting that water with a lower TDS level is not necessarily safer or of higher quality, as some minerals and salts can be beneficial for health. However, water with a very high TDS level can have a noticeable taste and smell, and some of the dissolved substances may be harmful to health. In such cases, it may need to be treated before it is consumed.
# plot histogram
fig = px.histogram(df, x=df['SOLIDS'], color="POTABILITY", color_discrete_sequence=[colors_green[3], colors_blue[3]],
marginal="box", nbins=100, opacity=0.7, barmode='group')
# include title, xaxis and yaxis labels
fig.update_layout(
font_family='monospace',
title=dict(text='Solids Distribution',x=0.47,y=0.98,
font=dict(color=colors_dark[2],size=20)),
xaxis_title_text='Solids (ppm)',
yaxis_title_text='Count',
)
fig.show()
Chloramines are a type of disinfectant used to treat drinking water. They are formed by adding ammonia to chlorine, which provides a longer-lasting residual in the water distribution system. Chloramines are commonly used as a secondary disinfectant in municipal water treatment plants because they are effective at controlling bacteria, viruses, and other pathogens. The acceptable level of chloramines in drinking water depends on the specific use and the guidelines used. According to the World Health Organization (WHO), the maximum acceptable level of chloramines in drinking water is 4 mg/L or 4 parts per million (ppm). This is a guideline value, and actual levels may vary depending on the local regulations and water treatment practices. It's important to note that some individuals may be sensitive to high levels of chloramines, particularly people with respiratory problems, as well as fish and other aquatic animals. In such cases, it may be necessary to treat the water to remove the chloramines before use.
# plot histogram
fig = px.histogram(df, x='CHLORAMINES', color="POTABILITY", color_discrete_sequence=[colors_green[3], colors_blue[3]],
marginal="box", nbins=100, opacity=0.7, barmode='group')
# include vertical lines
fig.add_vline(x=4, line_width=2, line_dash="dash", line_color=colors_dark[2])
# include in-chart texts
fig.add_annotation(dict(font=dict(color=colors_dark[2],size=15)), text='Safe', x=2, y=100, showarrow=False)
# include title, xaxis and yaxis labels
fig.update_layout(
font_family='monospace',
title=dict(text='Chloramines Distribution',x=0.47,y=0.98,
font=dict(color=colors_dark[2],size=20)),
xaxis_title_text='Chloramines (ppm)',
yaxis_title_text='Count',
)
fig.show()
Sulfate is a naturally occurring anion that can be found in water sources such as rivers, lakes, and groundwater. It is often present in water due to the dissolution of minerals such as gypsum, alunite, and epsomite. The acceptable level of sulfate in drinking water can vary depending on the source and guidelines used. Sulfate levels in drinking water should not exceed 250 milligrams per liter (mg/L). High levels of sulfate in water can cause digestive problems for some people, leading to symptoms such as diarrhea and abdominal discomfort. In addition, water with high sulfate levels can have a bitter or medicinal taste.
# plot histogram
fig = px.histogram(df, x='SULFATE', color="POTABILITY", color_discrete_sequence=[colors_green[3], colors_blue[3]],
marginal="box", nbins=100, opacity=0.7, barmode='group')
# include vertical lines
fig.add_vline(x=250, line_width=2, line_dash="dash", line_color=colors_dark[2])
# include in-chart texts
fig.add_annotation(dict(font=dict(color=colors_dark[2],size=15)), text='Safe', x=200, y=100, showarrow=False)
# include title, xaxis and yaxis labels
fig.update_layout(
font_family='monospace',
title=dict(text='Sulfate Distribution',x=0.47,y=0.98,
font=dict(color=colors_dark[2],size=20)),
xaxis_title_text='Sulfate (mg/L)',
yaxis_title_text='Count',
)
fig.show()
Conductivity is a measure of the ability of water to conduct an electrical current, and is often used as an indirect measure of the total dissolved solids (TDS) in water. The conductivity of water depends on the concentration of ions (such as salts, minerals, and acids) dissolved in the water, and is commonly measured in microsiemens per centimeter (µS/cm). The acceptable level of conductivity in drinking water can vary depending on the source and guidelines used. However, the World Health Organization (WHO) recommends that the conductivity of drinking water should not exceed 2,000 microsiemens per centimeter (µS/cm).
# plot histogram
fig = px.histogram(df, x='CONDUCTIVITY', color="POTABILITY", color_discrete_sequence=[colors_green[3], colors_blue[3]],
marginal="box", nbins=100, opacity=0.7, barmode='group')
# include in-chart texts
fig.add_annotation(dict(font=dict(color=colors_dark[2],size=15)), text='All data points are within<br>the safe region (<2000)',
x=600, y=100, showarrow=False)
# include title, xaxis and yaxis labels
fig.update_layout(
font_family='monospace',
title=dict(text='Conductivity Distribution',x=0.47,y=0.98,
font=dict(color=colors_dark[2],size=20)),
xaxis_title_text='Conductivity (µS/cm)',
yaxis_title_text='Count',
)
fig.show()
Organic carbon in water refers to the presence of organic compounds and substances, such as sugars, starches, cellulose, and other complex carbon-based molecules, that are dissolved or suspended in water. Organic carbon can come from a variety of sources, including natural processes such as plant and animal decay, and human activities such as wastewater discharge and agricultural runoff.
The acceptable level of organic carbon in water can vary depending on the intended use of the water and the regulations that apply. In general, the World Health Organization (WHO) recommends that drinking water should have no more than 10 mg/L of total organic carbon (TOC) to ensure that it is safe and palatable.
# plot histogram
fig = px.histogram(df, x='ORGANIC_CARBON', color="POTABILITY", color_discrete_sequence=[colors_green[3], colors_blue[3]],
marginal="box", nbins=100, opacity=0.7, barmode='group')
# include vertical lines
fig.add_vline(x=10, line_width=2, line_dash="dash", line_color=colors_dark[2])
# include in-chart texts
fig.add_annotation(dict(font=dict(color=colors_dark[2],size=15)), text='Safe', x=5, y=100, showarrow=False)
# include title, xaxis and yaxis labels
fig.update_layout(
font_family='monospace',
title=dict(text='Organic Carbon Distribution',x=0.47,y=0.98,
font=dict(color=colors_dark[2],size=20)),
xaxis_title_text='Organic Carbon (ppm)',
yaxis_title_text='Count',
)
fig.show()
Trihalomethanes (THMs) are a group of chemical compounds that are formed when chlorine or other disinfectants used to treat drinking water react with naturally occurring organic matter in the water. The most common THMs in drinking water are chloroform, bromodichloromethane, dibromochloromethane, and bromoform.
The acceptable level of THMs in drinking water is regulated by various national and international organizations. For example, in the United States, the Environmental Protection Agency (EPA) has established a maximum contaminant level (MCL) for total THMs of 80 micrograms per liter (µg/L) in drinking water.
# plot histogram
fig = px.histogram(df, x='TRIHALOMETHANES', color="POTABILITY", color_discrete_sequence=[colors_green[3], colors_blue[3]],
marginal="box", nbins=100, opacity=0.7, barmode='group')
# include vertical lines
fig.add_vline(x=80, line_width=2, line_dash="dash", line_color=colors_dark[2])
# include in-chart texts
fig.add_annotation(dict(font=dict(color=colors_dark[2],size=15)), text='Safe', x=35, y=80, showarrow=False)
# include title, xaxis and yaxis labels
fig.update_layout(
font_family='monospace',
title=dict(text='Trihalomethanes Distribution',x=0.47,y=0.98,
font=dict(color=colors_dark[2],size=20)),
xaxis_title_text='Trihalomethanes (µg/L)',
yaxis_title_text='Count',
)
fig.show()
Turbidity refers to the cloudiness or haziness of water caused by the presence of suspended particles, such as clay, silt, algae, or other organic matter. Turbidity can reduce the amount of light that passes through the water and can affect its color, taste, and odor.
The acceptable level of turbidity in drinking water is 5 NTU. However, levels below 1 NTU is most ideal.
# plot histogram
fig = px.histogram(df, x='TURBIDITY', color="POTABILITY", color_discrete_sequence=[colors_green[3], colors_blue[3]],
marginal="box", nbins=100, opacity=0.7, barmode='group')
# include vertical lines
fig.add_vline(x=5, line_width=2, line_dash="dash", line_color=colors_dark[2])
# include in-chart texts
fig.add_annotation(dict(font=dict(color=colors_dark[2],size=15)), text='Safe', x=2.5, y=80, showarrow=False)
# include title, xaxis and yaxis labels
fig.update_layout(
font_family='monospace',
title=dict(text='Turbidity Distribution',x=0.47,y=0.98,
font=dict(color=colors_dark[2],size=20)),
xaxis_title_text='Turbidity (NTU)',
yaxis_title_text='Count',
)
fig.show()
The columns that have null values are pH, Sulfate, and Trihalomethanes.
df.isna().sum()
PH 491 HARDNESS 0 SOLIDS 0 CHLORAMINES 0 SULFATE 781 CONDUCTIVITY 0 ORGANIC_CARBON 0 TRIHALOMETHANES 162 TURBIDITY 0 POTABILITY 0 dtype: int64
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='crest_r')
<Axes: >
# subset data according to potability
df[df['POTABILITY']==1].describe()
| PH | HARDNESS | SOLIDS | CHLORAMINES | SULFATE | CONDUCTIVITY | ORGANIC_CARBON | TRIHALOMETHANES | TURBIDITY | POTABILITY | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 1101.000000 | 1278.000000 | 1278.000000 | 1278.000000 | 985.000000 | 1278.000000 | 1278.000000 | 1223.000000 | 1278.000000 | 1278.0 |
| mean | 7.073783 | 195.800744 | 22383.991018 | 7.169338 | 332.566990 | 425.383800 | 14.160893 | 66.539684 | 3.968328 | 1.0 |
| std | 1.448048 | 35.547041 | 9101.010208 | 1.702988 | 47.692818 | 82.048446 | 3.263907 | 16.327419 | 0.780842 | 0.0 |
| min | 0.227499 | 47.432000 | 728.750830 | 0.352000 | 129.000000 | 201.619737 | 2.200000 | 8.175876 | 1.492207 | 1.0 |
| 25% | 6.179312 | 174.330531 | 15668.985038 | 6.094134 | 300.763772 | 360.939023 | 12.033897 | 56.014249 | 3.430909 | 1.0 |
| 50% | 7.036752 | 196.632907 | 21199.386615 | 7.215163 | 331.838167 | 420.712729 | 14.162809 | 66.678214 | 3.958576 | 1.0 |
| 75% | 7.933068 | 218.003420 | 27973.236447 | 8.199261 | 365.941346 | 484.155911 | 16.356245 | 77.380975 | 4.509569 | 1.0 |
| max | 13.175402 | 323.124000 | 56488.672410 | 13.127000 | 481.030642 | 695.369528 | 23.604298 | 124.000000 | 6.494249 | 1.0 |
df[df['POTABILITY']==0].describe()
| PH | HARDNESS | SOLIDS | CHLORAMINES | SULFATE | CONDUCTIVITY | ORGANIC_CARBON | TRIHALOMETHANES | TURBIDITY | POTABILITY | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 1684.000000 | 1998.000000 | 1998.000000 | 1998.000000 | 1510.000000 | 1998.000000 | 1998.000000 | 1891.000000 | 1998.000000 | 1998.0 |
| mean | 7.085378 | 196.733292 | 21777.490788 | 7.092175 | 334.564290 | 426.730454 | 14.364335 | 66.303555 | 3.965798 | 0.0 |
| std | 1.683499 | 31.057540 | 8543.068788 | 1.501045 | 36.745549 | 80.047317 | 3.334554 | 16.079320 | 0.780284 | 0.0 |
| min | 0.000000 | 98.452931 | 320.942611 | 1.683993 | 203.444521 | 181.483754 | 4.371899 | 0.738000 | 1.450000 | 0.0 |
| 25% | 6.037723 | 177.823265 | 15663.057382 | 6.155640 | 311.264006 | 368.498530 | 12.101057 | 55.706530 | 3.444062 | 0.0 |
| 50% | 7.035456 | 197.123423 | 20809.618280 | 7.090334 | 333.389426 | 422.229331 | 14.293508 | 66.542198 | 3.948076 | 0.0 |
| 75% | 8.155510 | 216.120687 | 27006.249013 | 8.066462 | 356.853897 | 480.677198 | 16.649485 | 77.277704 | 4.496106 | 0.0 |
| max | 14.000000 | 304.235912 | 61227.196010 | 12.653362 | 460.107069 | 753.342620 | 28.300000 | 120.030077 | 6.739000 | 0.0 |
Based on the distributions plotted earlier, the data for the columns with null values is not skewed so we can use the mean to fill the null values. There is no significant difference between the mean values for the columns with null values when the water is potable and when it is not potable. Thus, we can fill with the general mean value.
null_cols = ['PH', 'SULFATE', 'TRIHALOMETHANES']
for col in null_cols:
df[col].fillna(df[col].mean(), inplace=True)
# split dataset into features and label
X = df.drop('POTABILITY', axis=1)
y = df['POTABILITY']
# split dataset into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=0)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
We will employ multiple ml models and choose the top performers for hyperparameter tuning. The choice metric for this project will be the precision score as we want to focus on correctly predicting the cases where the water is truely potable.
models = [('LR', LogisticRegression(max_iter=1000, random_state=42)), ('SVC', SVC(random_state=42)), ('KNN', KNeighborsClassifier(n_neighbors=10)),
('DTC', DecisionTreeClassifier(random_state=42)), ('GNB', GaussianNB()), ('RFC', RandomForestClassifier(random_state=42)),
('GBC', GradientBoostingClassifier(random_state=42)), ('ABC', AdaBoostClassifier(base_estimator=RandomForestClassifier(), random_state=42))]
results = []
for name, model in models:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = precision_score(y_test, y_pred)
results.append((name, score))
results.sort(key=lambda a:a[1], reverse=True)
results
[('SVC', 0.7444444444444445),
('RFC', 0.6638655462184874),
('ABC', 0.6528925619834711),
('GBC', 0.6122448979591837),
('KNN', 0.5588235294117647),
('GNB', 0.46788990825688076),
('DTC', 0.4596774193548387),
('LR', 0.0)]
# SVC, 'ABC', 'GBC', 'RFC'
parameters = {
'SVC':{
'model':SVC(random_state=42),
'params':{
'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ["rbf"]
}
},
'ABC':
{
'model':AdaBoostClassifier(random_state=42),
'params':
{
'n_estimators':[10,50,100,200],
'learning_rate':[0.1, 0.01, 0.001, 0.0001],
'base_estimator':[RandomForestClassifier(), SVC()]
}
},
'GBC':
{
'model':GradientBoostingClassifier(random_state=42),
'params':
{
'learning_rate':[0.0001,0.001,0.01,0.1],
'n_estimators':[100,200,500,1000],
'max_features':['sqrt','log2'],
'max_depth':list(range(11))
}
},
'Random Forest':
{
'model':RandomForestClassifier(random_state=42),
'params':
{
'n_estimators':[10,50,100,200],
'max_features':['auto','sqrt','log2'],
'max_depth':list(range(11))
}
}
}
cv = RepeatedStratifiedKFold(n_splits=5,n_repeats=2)
scores=[]
for model_name,params in parameters.items():
rs = RandomizedSearchCV(params['model'],params['params'],cv=cv,n_iter=20, scoring='precision')
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
rs.fit(X_scaled,y)
scores.append([model_name,dict(rs.best_params_),rs.best_score_])
data=pd.DataFrame(scores,columns=['Model','Parameters','Score'])
data
| Model | Parameters | Score | |
|---|---|---|---|
| 0 | SVC | {'kernel': 'rbf', 'gamma': 1, 'C': 1} | 0.884616 |
| 1 | ABC | {'n_estimators': 100, 'learning_rate': 0.01, '... | 0.645014 |
| 2 | GBC | {'n_estimators': 500, 'max_features': 'sqrt', ... | 0.877455 |
| 3 | Random Forest | {'n_estimators': 200, 'max_features': 'sqrt', ... | 0.777456 |
data.sort_values('Score', ascending=False)
| Model | Parameters | Score | |
|---|---|---|---|
| 0 | SVC | {'kernel': 'rbf', 'gamma': 1, 'C': 1} | 0.884616 |
| 2 | GBC | {'n_estimators': 500, 'max_features': 'sqrt', ... | 0.877455 |
| 3 | Random Forest | {'n_estimators': 200, 'max_features': 'sqrt', ... | 0.777456 |
| 1 | ABC | {'n_estimators': 100, 'learning_rate': 0.01, '... | 0.645014 |
The top two performing models are the GradientBoostingClassifier and the SVC
To further improve the model, we use the voting classifier. A voting classifier is an ensemble learning method in machine learning where multiple models are combined to make predictions. The idea behind a voting classifier is to combine the predictions of several models and make a final prediction by selecting the most frequent class prediction among the models.The main advantage of using a voting classifier is that it can improve the accuracy of predictions by combining the strengths of different models.
For this project, we will use the top two models based on precision score i.e. Gradient Boosting Classifier and Support Vector Classifier.
param = data['Parameters']
model = VotingClassifier(estimators=[
('GBC', GradientBoostingClassifier(random_state=42, **param[3])),
('SVC', SVC(random_state=42, **param[0])),
], voting='hard')
precision = []
feature_importances = []
scaler = MinMaxScaler()
skf = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)
skf.get_n_splits(X, y)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y.loc[train_index], y.loc[test_index]
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Fit the voting classifier
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = precision_score(y_test, y_pred)
precision.append(score)
# Calculate feature importances for GradientBoostingClassifier
gbc = model.named_estimators_['GBC'] # Access the GBC estimator
feature_importances.append(gbc.feature_importances_)
# Calculate mean precision score
mean_precision_score = np.mean(precision)
print("Mean Precision Score:", mean_precision_score)
Mean Precision Score: 0.9058514937075308
!pip install shap
Requirement already satisfied: shap in c:\users\user\anaconda3\lib\site-packages (0.46.0) Requirement already satisfied: numpy in c:\users\user\anaconda3\lib\site-packages (from shap) (1.24.3) Requirement already satisfied: scipy in c:\users\user\anaconda3\lib\site-packages (from shap) (1.11.1) Requirement already satisfied: scikit-learn in c:\users\user\anaconda3\lib\site-packages (from shap) (1.3.2) Requirement already satisfied: pandas in c:\users\user\anaconda3\lib\site-packages (from shap) (2.0.3) Requirement already satisfied: tqdm>=4.27.0 in c:\users\user\anaconda3\lib\site-packages (from shap) (4.65.0) Requirement already satisfied: packaging>20.9 in c:\users\user\anaconda3\lib\site-packages (from shap) (23.1) Requirement already satisfied: slicer==0.0.8 in c:\users\user\anaconda3\lib\site-packages (from shap) (0.0.8) Requirement already satisfied: numba in c:\users\user\anaconda3\lib\site-packages (from shap) (0.57.1) Requirement already satisfied: cloudpickle in c:\users\user\anaconda3\lib\site-packages (from shap) (2.2.1) Requirement already satisfied: colorama in c:\users\user\anaconda3\lib\site-packages (from tqdm>=4.27.0->shap) (0.4.6) Requirement already satisfied: llvmlite<0.41,>=0.40.0dev0 in c:\users\user\anaconda3\lib\site-packages (from numba->shap) (0.40.0) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\user\anaconda3\lib\site-packages (from pandas->shap) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\user\anaconda3\lib\site-packages (from pandas->shap) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in c:\users\user\anaconda3\lib\site-packages (from pandas->shap) (2023.3) Requirement already satisfied: joblib>=1.1.1 in c:\users\user\anaconda3\lib\site-packages (from scikit-learn->shap) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from scikit-learn->shap) (2.2.0) Requirement already satisfied: six>=1.5 in c:\users\user\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas->shap) (1.16.0)
import shap
# Fit the voting classifier to get SHAP values
model.fit(X_train, y_train)
# Specify the feature names
feature_names = ['pH', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity', 'Organic_carbon', 'Trihalomethanes', 'Turbidity']
# Use the explainer for the specific model type (e.g., Gradient Boosting Classifier)
explainer = shap.Explainer(model.named_estimators_['GBC'], X_train)
# Get SHAP values
shap_values = explainer(X_test)
# Convert X_test into a DataFrame with the new feature names
X_test_df = pd.DataFrame(X_test, columns=feature_names)
# Plot SHAP summary with renamed features
shap.summary_plot(shap_values, X_test_df)
# Assuming feature_names is a list of your original feature names
feature_names = ['pH', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity', 'Organic_carbon', 'Trihalomethanes', 'Turbidity']
# Calculate mean absolute SHAP values
shap_abs_mean = np.abs(shap_values.values).mean(axis=0)
# Create a DataFrame using the feature names
shap_df = pd.DataFrame({'Feature': feature_names, 'Mean Absolute SHAP Value': shap_abs_mean})
shap_df = shap_df.sort_values(by='Mean Absolute SHAP Value', ascending=False)
print(shap_df)
Feature Mean Absolute SHAP Value 0 pH 0.254358 4 Sulfate 0.238458 1 Hardness 0.189064 2 Solids 0.176974 3 Chloramines 0.146990 5 Conductivity 0.121009 8 Turbidity 0.103085 6 Organic_carbon 0.098062 7 Trihalomethanes 0.094397
plt.figure(figsize=(10, 6))
plt.barh(shap_df['Feature'], shap_df['Mean Absolute SHAP Value'], color='skyblue')
plt.xlabel('Mean Absolute SHAP Value')
plt.title('Feature Importance based on Mean Absolute SHAP Values')
plt.gca().invert_yaxis()
plt.show()
SHAP analysis was finally carried out to understand the overall importance of each feature to the model, and how these features contributed (in magnitude) to each predicted outcome. As seen in the plots above, the SHAP analysis revealed that overall, two very important features to the model were pH and Sulfate. This implies that from our data, on average, the two most important physicochemical properties that determines to a large extent if a water sample is potable or not, are pH and Sulfate.
The pH feature which had the most impact on the model, will be briefly explained, and also how it contributed to predictions.
• pH: This simply represents the level of acidity or alkalinity contained in each water sample. It ranges from 0 to 14, where lower values indicate alkalinity, higher values indicate alkalinity, and 7 is neutral. From the SHAP plot, extreme pH values (tending towards 0 and 14) give rise to negative SHAP values, which in this case represents ”non portability” predictions; while values tending towards the center (7) tend to give rise to positive SHAP values, indicating ”potability” predictions. This is in line with general standards that pure water has a neutral pH of 7, which indicates that it is neither acidic or basic, and hence it is normal for a potable water to have a range of between 6.5 and 8.5 on the scale.
The model was able to predict the potability of water with a high precision score as this the metric of interest. However, it is worthy to note that recall score will be quite low.